Lesson 4


Scatterplots and Perceived Audience Size

Notes: 用户低估来信息接受人数 ***

Scatterplots

Notes:对于连续的变量,要探究其相关性时,散点图是最好的选择

pf <- read.csv("../lesson3/pseudo_facebook.tsv", sep="\t")
library(ggplot2)
qplot(x=age, y=friend_count, data=pf)

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_point()


What are some things that you notice right away?

Response: 多点过于集中,年龄可能存在虚假报告 ***

ggplot Syntax

Notes:

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_point(alpha=0.05) +
  xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).


Overplotting

Notes:因为点重叠无法真是反应出年龄和朋友数量的关系(因为年龄是连续性书之后),可以使用jitter方法来人为增加noise,以增强显示点。下图就是使用jitter后展现效果

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_jitter(alpha=0.05) +
  xlim(13, 90)
## Warning: Removed 5159 rows containing missing values (geom_point).

What do you notice in the plot?

Response:在60岁左右的用户的朋友数具有明显的数量多;即使是年轻人朋友数量也主要集中在2000人以下


Coord_trans()

Notes:[coor_trans文档](http://ggplot2.tidyverse.org/reference/coord_trans.html) is different to scale transformations in that it occurs after statistical transformation and will affect the visual appearance of geoms - there is no guarantee that straight lines will continue to be straight.

示例: coord_trans(x = “identity”, y = “identity”, limx = NULL, limy = NULL, xtrans, ytrans)

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_point(alpha=0.05) +
  xlim(13, 90)+
  coord_trans(y="sqrt")
## Warning: Removed 4906 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

在使用jitter是需要注意,如果直接使用jitter的话,可能因为点在0上使用sqrt之后而出现虚数,为了避免该情况,使用position参数来指定针对age进行抖动。下图是使用position之后的图,可以看出没有出现y轴0以下的数。关于position_jitter解释

ggplot(aes(x=age, y=friend_count), data=pf) +
  geom_point(alpha=0.05, position=position_jitter(h=0)) +
  xlim(13, 90)+
  coord_trans(y="sqrt")
## Warning: Removed 5184 rows containing missing values (geom_point).

What do you notice?


Alpha and Jitter

Notes:

ggplot(aes(x=age, y=friendships_initiated), data=pf) +
  geom_jitter(alpha=0.05) +
  xlim(13, 90)
## Warning: Removed 5193 rows containing missing values (geom_point).

ggplot(aes(x=age, y=friendships_initiated), data=pf) +
  geom_point(alpha=0.05, position=position_jitter(h=0)) +
  coord_trans(y="sqrt")+
  xlim(13, 90)
## Warning: Removed 5174 rows containing missing values (geom_point).


Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:使用dplyr库,进行相关处理,注意函数n不需要使用参数——它是直接用于结算数量的

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>% group_by(age) %>% 
  summarise(mean=mean(friend_count), median=median(friend_count), n=n()) %>% 
  arrange(age)
pf.fc_by_age
## # A tibble: 101 x 4
##      age     mean median     n
##    <int>    <dbl>  <dbl> <int>
##  1    13 164.7500   74.0   484
##  2    14 251.3901  132.0  1925
##  3    15 347.6921  161.0  2618
##  4    16 351.9371  171.5  3086
##  5    17 350.3006  156.0  3283
##  6    18 331.1663  162.0  5196
##  7    19 333.6921  157.0  4391
##  8    20 283.4991  135.0  3769
##  9    21 235.9412  121.0  3671
## 10    22 211.3948  106.0  3032
## # ... with 91 more rows

Create your plot!

ggplot(aes(x=age, y=mean), data=pf.fc_by_age) +
  geom_line() +
  scale_x_continuous(limits=c(13,90))
## Warning: Removed 23 rows containing missing values (geom_path).


Overlaying Summaries with Raw Data

Notes:如果需要在图形上添加摘要信息,可以使用相关的图形,并且加入相关的统计信息参数,如geom_line(stat=“summary”, fun.y=mean)——其中stat表示统计方法,fun.y针对的是使用统计的函数

ggplot(aes(age, friend_count), data=pf) +
  coord_cartesian(xlim=c(13,90), ylim=c(0,2000)) +
  geom_point(alpha=0.05, position=position_jitter(h=0), color="orange") +
  coord_trans(y="sqrt") +
  geom_line(stat="summary", fun.y=mean) +
  geom_line(stat="summary", fun.y=quantile, fun.args=list(probs=.1), linetype=2, color="blue") +
  geom_line(stat="summary", fun.y=quantile, fun.args=list(probs=.9), linetype=2, color="blue") +
  geom_line(stat="summary", fun.y=quantile, fun.args=list(probs=.5), color="blue")

What are some of your observations of the plot?

Response: 年龄越大朋友数量反而会表现增大的趋势,所有的朋友数都集中在1000以下 ***

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

with(pf, cor(age, friend_count))
## [1] -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes: with方法的使用是不仅减少了数据输入,而且加快运行速度

with(subset(pf, age<=70), cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Correlation Methods

Notes:


with(subset(pf, age<=70), cor.test(age, friend_count, method="spearman"))
## Warning in cor.test.default(age, friend_count, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2552934

Create Scatterplots

Notes:

ggplot(aes(x=www_likes_received, y=likes_received), data=pf) + 
  geom_point(alpha=.05) +
  coord_cartesian(ylim=c(0,10000), xlim=c(0, 6000))


Strong Correlations

Notes:

ggplot(aes(x=www_likes_received, y=likes_received), data=pf) +
  geom_point() +
  xlim(0, quantile(pf$www_likes_received, .95)) +
  ylim(0, quantile(pf$likes_received, 0.95)) +
  geom_smooth(method="lm", color="red")
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

with(pf, cor.test(www_likes_received, likes_received), method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  www_likes_received and likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response:


Moira on Correlation

Notes: 相关性对于探究多变量之间关系具有重要且直观的意义,可以推动是否需要继续研究两者之间关系。但是需要注意,直接使用相关性探究可能得到的数值是具有强相关性,但是在推荐使用scatter plot来继续查看是否具有某种相关性。 ***

More Caution with Correlation

Notes:

library(alr3)
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

Create your plot!

ggplot(aes(Month, Temp), data=Mitchell) +
 geom_point()

with(Mitchell, cor.test(Month, Temp))
## 
##  Pearson's product-moment correlation
## 
## data:  Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot.

  2. What is the actual correlation of the two variables? (Round to the thousandths place)


Making Sense of Data

Notes:

ggplot(aes(Month, Temp), data=Mitchell) +
  geom_point() +
  scale_x_continuous(breaks=seq(0, max(Mitchell$Month), 11)) +
  geom_path()


A New Perspective

What do you notice? Response:从时间周期来看,一个人的年龄还可以延伸到多少岁几个月——因为具有时间周期性,所以接下来将使用增加月份来探究

Watch the solution video and check out the Instructor Notes! Notes:因为年龄里面的线不是平滑的,猜测具有噪点,因此通过对年龄调整来把年龄数据细化在探索


Understanding Noise: Age to Age Months

Notes:

pf$age_with_month <- with(pf, age + (1 - dob_month/12))

Age with Months Means

pf.fc_by_age_months <- pf %>% 
  group_by(age_with_month) %>% 
  summarise(mean=mean(friend_count), median=median(friend_count), n=n()) %>% 
  arrange(age_with_month)
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
##   age_with_month      mean median     n
##            <dbl>     <dbl>  <dbl> <int>
## 1       13.16667  46.33333   30.5     6
## 2       13.25000 115.07143   23.5    14
## 3       13.33333 136.20000   44.0    25
## 4       13.41667 164.24242   72.0    33
## 5       13.50000 131.17778   66.0    45
## 6       13.58333 156.81481   64.0    54

Programming Assignment

ggplot(aes(age_with_month, mean), data=pf.fc_by_age_months) +
  geom_line() +
  xlim(13, 70) +
  ylim(0, 500)
## Warning: Removed 511 rows containing missing values (geom_path).


Noise in Conditional Means

针对

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
p1 <- ggplot(aes(x=age, y=mean), data=subset(pf.fc_by_age, age<71)) +
  geom_line() #只探索0-70岁的用户
p2 <- ggplot(aes(age_with_month, mean), data=subset(pf.fc_by_age_months, age_with_month<71)) +
  geom_line() #增加噪点研究年龄上的平均朋友数变化
grid.arrange(p2, p1, ncol=1)


减小年龄间的范围,参考方法是使用局部回归思路,直观图解见文档 ### Smoothing Conditional Means Notes:

library(gridExtra)
p1 <- ggplot(aes(age, mean), data=subset(pf.fc_by_age, age<71)) +
  geom_line() #只探索0-70岁的用户
p2 <- ggplot(aes(age_with_month, mean), data=subset(pf.fc_by_age_months, age_with_month<71)) +
  geom_line() #增加噪点研究年龄上的平均朋友数变化
p3 <- ggplot(aes(round(age/5)*5, friend_count), data=subset(pf, age<71))+
  geom_line(stat="summary", fun.y=mean) #通过更改用户年龄来改变绘图是用到的年龄段,减少noise——将5岁年龄内的数据归为一类
grid.arrange(p2, p1, p3, ncol=1)

library(gridExtra)
p1 <- ggplot(aes(age, mean), data=subset(pf.fc_by_age, age<71)) +
  geom_line() +
  geom_smooth() #只探索0-70岁的用户
p2 <- ggplot(aes(age_with_month, mean), data=subset(pf.fc_by_age_months, age_with_month<71)) +
  geom_line() +
  geom_smooth() #增加噪点研究年龄上的平均朋友数变化
p3 <- ggplot(aes(round(age/5)*5, friend_count), data=subset(pf, age<71))+
  geom_line(stat="summary", fun.y=mean) #通过更改用户年龄来改变绘图是用到的年龄段,减少noise——将5岁年龄内的数据归为一类
grid.arrange(p2, p1, p3, ncol=1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

从上图看出,在第二种方案中更容易产生系统误差 ***

Which Plot to Choose?

Notes: 数据折衷是多种方案的选择,不适单独某个方案的能够确定关系。所以数据分析时,最佳方案时选择多个方案进行分析 ***

Analyzing Two Variables

Reflection:散点图,相关系数,模型判断,jitter,多方案折衷


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!